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Abstract 

We consider infinite-horizon stationary 7-discounted Markov Decision Processes, 
for which it is known that there exists a stationary optimal policy. Using Value 
and Policy Iteration with some error e at each iteration, it is well-known that one 
can compute stationary policies that are jj^u e-optimal. After arguing that this 
guarantee is tight, we develop variations of Value and Policy Iteration for com- 
puting non-stationary policies that can be up to jz^j e-optimal, which constitutes a 
significant improvement in the usual situation when 7 is close to 1. Surprisingly, 
this shows that the problem of "computing near-optimal non-stationary policies" 
is much simpler than that of "computing near-optimal stationary policies". 



1 Introduction 

Given an infinite-horizon stationary 7-discounted Markov Decision Process ll24l |4l. we consider 
approximate versions of the standard Dynamic Programming algorithms, Policy and Value Iteration, 
that build sequences of value functions Vk and policies Hk as follows 

Approximate Value Iteration (AVI): Vk+i <— Tvk + tk+i (1) 

Approximate Policy Iteration (API): { Vk f~ ^ + efc (2) 
rr J I ^fe+i *~ any element of £/ (1^, ) 

where vq and ttq are arbitrary, T is the Bellman optimality operator, v„ k is the value of policy iik 
and Q(vk) is the set of policies that are greedy with respect to Vk- At each iteration k, the term 
accounts for a possible approximation of the Bellman operator (for AVI) or the evaluation of v- Kk 
(for API). Throughout the paper, we will assume that error terms €k satisfy for all k, W^kW^ < e 
for some e > 0. Under this assumption, it is well-known that both algorithms share the following 
performance bound (see l25l[TTl l4l for AVI and |4| for API): 

Theorem 1. For API ( resp. AVI), the loss due to running policy iTk ( resp. any policy nk in G(vk-i) ) 
instead of the optimal policy 7r* satisfies 



27 

/ - v (1 - 7) 2 



limsuplK -tvJL < 



The constant rr~^3 can ^ e ver y ^S> m particular when 7 is close to 1, and consequently the above 
bound is commonly believed to be conservative for practical applications. Interestingly, this very 
constant (1 ^L 3 appears in many works analyzing AVI algorithms lIBlll lll27|[T2l[T3l 1231 171 l6l l20ll2T1 
|22l[9), API algorithms (13] [19] [131 [I] HI EHl 111 HZ] HQl [3 |9l E| and in one of their generalization 11261 . 
suggesting that it cannot be improved. Indeed, the bound (and the rn^p constant) are tight for 
API ||4] Example 6.4], and we will show in Section[3]- to our knowledge, this has never been argued 
in the literature - that it is also tight for AVI. 
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Even though the theory of optimal control states that there exists a stationary policy that is optimal, 
the main contribution of our paper is to show that looking for a non-stationary policy (instead of a 
stationary one) may lead to a much better performance bound. In Section |4] we will show how to 
deduce such a non-stationary policy from a run of AVI. In Section [5] we will describe two original 
policy iteration variations that compute non-stationary policies. For all these algorithms, we will 
prove that we have a performance bound that can be reduced down to jz~ e - This is a factor 
better than the standard bound of TheoremQ] which is significant when 7 is close to 1. Surprisingly, 
this will show that the problem of "computing near-optimal non-stationary policies" is much simpler 
than that of "computing near-optimal stationary policies". Before we present these contributions, the 
next section begins by precisely describing our setting. 



2 Background 

We consider an infinite-horizon discounted Markov Decision Process [24 4| (S, A, P, r, 7), where S 
is a possibly infinite state space, A is a finite action space, P(ds'\s, a), for all (s, a), is a probability 
kernel on S, r : S x A — > K is a reward function bounded in max-norm by R max , and 7 £ (0, 1) 
is a discount factor. A stationary deterministic policy tt : S — > A maps states to actions. We write 
rv(s) = r(s,7r(s)) and P 7r (ds'\s) — P(ds'\s,Tr(s)) for the immediate reward and the stochastic 
kernel associated to policy tt. The value v n of a policy tt is a function mapping states to the expected 
discounted sum of rewards received when following tt from any state: for all s£iS, 



v v (s)=E 



.t=Q 



Sq = S, S t+ i - Pn(-\s t ) 



The value v v is clearly bounded by V max — i? m ax/(l — 7)- It is well-known that v„ can be 
characterized as the unique fixed point of the linear Bellman operator associated to a policy tt: 
T OT : v n- 7v + jPnV. Similarly, the Bellman optimality operator T : v h-> max T T n v has as 
unique fixed point the optimal value v* = maxj v n . A policy tt is greedy w.r.t. a value function v 
if T^v — Tv, the set of such greedy policies is written Q(v). Finally, a policy 71% is optimal, with 
value v^, = «*, iff tt* G (/(«*), or equivalently T^v* = w*. 

Though it is known (24] 4 1 that there always exists a deterministic stationary policy that is optimal, 
we will, in this article, consider non-stationary policies and now introduce related notations. Given 
a sequence tt\ , %2 > ■ ■ ■ > i^k °f k stationary policies (this sequence will be clear in the context we 
describe later), and for any 1 < m < k, we will denote 7Tfc m the periodic non-stationary policy 
that takes the first action according to tt^, the second according to it^-i, . . . , the m th according to 
TTfc-m+i and then starts again. Formally, this can be written as 

7Tfc,m = TTfc ' • ' TTfc-m+1 Tfe Tfc-l • • • TTk-m+l ' ' ' 

It is straightforward to show that the value v nk of this periodic non-stationary policy TTk, m i s the 
unique fixed point of the following operator: 

Tk,m = T- Kh T- Kk _ 1 ■ ■ ■ T 7! - k _ m+1 . 

Finally, it will be convenient to introduce the following discounted kernel: 

r fc , m = (7-^X7^-1) • • • (7-P7r fc _ m+1 )- 

In particular, for any pair of values v and v', it can easily be seen thatTk, m v—Tk. m v' — Tk y m{v—v'). 



3 Tightness of the performance bound of Theorem [T] 

The bound of Theorem Q] is tight for API in the sense that there exists an MDP ||4] Example 6.4] 
for which the bound is reached. To the best of our knowledge, a similar argument has never been 
provided for AVI in the literature. It turns out that the MDP that is used for showing the tightness 
for API also applies to AVI. This is what we show in this section. 

Example 1. Consider the ^-discounted deterministic MDP from [4, Example 6.4] depicted on Fig- 
ure\l\ It involves states 1,2,.... In state 1 there is only one self-loop action with zero reward, for 
each state i > 1 there are two possible choices: either move to state i — 1 with zero reward or stay 
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-2 7 e -2( 7 + 7 2 )e - 2 i=T 



Figure 1: The determinisitic MDP for which the bound of Theorem Q] is tight for Value and Policy 
Iteration. 

with reward = —2 1^ e with e > 0. Clearly the optimal policy in all states i > 1 is to move to 
i — 1 anc/ the optimal value function is z'n a// states. 

Starting with vq — u*, we are going to show that for all iterations k > 1 it is possible to have a 
policy -Kk+i E Q{vk) which moves in every state but k + 1 and thus is such that v 1Tk+1 (k + 1) = 

jzz^ = —2 ^~2^yi e, which meets the bound of Theorem\T\when k tends to infinity. 

To do so, we assume that the following approximation errors are made at each iteration k > 0: 

{— e if i = k 
e if i = k + 1 . 
otherwise 

With this error, we are now going to prove by induction on k that for all k > 1, 

!— 7 fe_1 e if i < k 

rfc/2 — e if i — k 

-(r fe /2-e) ifi = jfc + l ■ 

otherwise 

Since vo — f/ze Z?es? action is clearly to move in every state i > 2 which gives v± = vq + ej = ei 
which establishes the claim for k = 1. 

Assuming that our induction claim holds for k, we now show that it also holds for k + 1. 

For f/ze move action, write q" 1 j'ts action-value function. For all i > 1 we /zave <Z™ W — + l v k{i — 
1), hence 

7(-7 fc ~ 1 e) = -7 fc e ifi = 2,...,k 
7(r fc /2-e) = r fe+1 /2 ifi = k + l 
-7(r fc /2-e) = -r fe+1 /2 ifi = k + 2 ' 
otherwise 



For the stay action, write q s k its action-value function. For all i > we teve = J"i + ^Vk{i), 

hence 

r-i + 7 (-7 fc ~ 1 e) =r 4 -7 fc e zf j = 1, . . . , fc - 1 

^fc + 7( r fc/ 2 - e ) = r fc + nt+i/2 z/i = fc 

9k(«)=4 r fc+ i -r fc+1 /2 = r fe+ i/2 z/i = fc + 1 

^fc+2+70 = r fc+2 ifi = k + 2 

otherwise 

First, only the stay action is available in state 1, hence, since ro = cmof €fc+i(l) = 0, we /lave 
= 9|(l) + £fc+i(l) = — 7 fc e, as desired. Second, since T{ < 0/or all i > 1 we have 
> q%(i) for all these states but k + I where q™(fc + 1) = g|(fc + l) = rk+i/2. Using the fact 
that Vk+i — max(g™, g|) + efc+i give* f/ze result for Vk+i- 

The fact that for i > 1 we /;ave > <?| (i) wz?/; equality only at i = k + 1 implies that there exists 

a policy itk+i greedy for Vk which takes the optimal move action in all states but k + 1 where the 
stay action has the same value, leaving the algorithm the possibility of choosing the suboptimal stay 
action in this state, yielding a value v 7rk+1 (fc + 1), matching the upper bound as k goes to infinity. 

Since Example[T]shows that the bound of Theorem[T]is tight, improving performance bounds imply 
to modify the algorithms. The following sections of the paper shows that considering non-stationary 
policies instead of stationary policies is an interesting path to follow. 
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4 Deducing a non-stationary policy from AVI 



While AVI (Equation ([]])) is usually considered as generating a sequence of values vq,Vi,..., Vk-i, 
it also implicitely produces a sequence^ of policies 7Ti,7T2, • • • , 7Tfc, where for i = Q, . . . , k — 1, 
7Ti+i £ Instead of outputing only the last policy irk, we here simply propose to output the 

periodic non-stationary policy Hk.m that loops over the last to generated policies. The following 
theorem shows that it is indeed a good idea. 

Theorem 2. For all iteration ft and m such that 1 < m < k, the loss of running the non-stationary 
policy TTk,m instead of the optimal policy 7r* satisfies: 

IK-*vJL ^ (l^T e + 7fe||v * ~ v ° l 

When m = 1 and k tends to infinity, one exactly recovers the result of Theorem Q] For general 
to, this new bound is a factor better than the standard bound of TheoremQ] The choice that 

optimizes the bound, to = k, and which consists in looping over all the policies generated/rom the 
very start, leads to the following bound: 



that tends to r^-e when k tends to oo. 

1-7 

The rest of the section is devoted to the proof of Theorem [2] An important step of our proof lies 
in the following lemma, that implies that for sufficiently big m, Vk = Tvk-i + eu is a rather good 
approximation (of the order jh^) of the value v„ h m of the non-stationary policy TTk,m (whereas in 
general, it is a much poorer approximation of the value v^ k of the last stationary policy 7Tfc). 
Lemma 1. For all to and k such that 1 < to < k, 

HTUfc-i - Vir k || oo < 7 m ||ffe_ m - V Vk m || oc + — 6. 

1-7 

Proof of Lemma\J] The value of 7Tfc. m satisfies: 

By induction, it can be shown that the sequence of values generated by AVI satisfies: 

rn — l 

Tn k Vk-l = T 7Tk T 7Tk _ 1 ■ ■ ■ T nk _ m+1 Vk-m + ^ ^k,i^k-i- (4) 

i=l 

By substracting Equations © and (O, one obtains: 

m— 1 

Tvk-l - V nk m = T Wk Vk-l - V 7Tk m = Tk,m{vk-m - Vn k , m ) + X! ^k,i£k-i 

i=l 

and the result follows by taking the norm and using the fact that for all i, \\Tk,i \\oo = 7 l - D 
We are now ready to prove the main result of this section. 

Proof of Theorem\2\ Using the fact that T is a contraction in max-norm, we have: 

IK - ^felloe = IK - Tv k -i + e k \ 



I OO 



< WTv.-Tvt-iWn+e 
<7lk*-«fc-i|loo + e - 



1 A given sequence of value functions may induce many sequences of policies since more than one greedy 
policy may exist for one particular value function. Our results holds for all such possible choices of greedy 
policies. 
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Then, by induction on k, we have that for all k > 1, 

k 1 - l k 

\\v* - Vk\\oo < 7 \\ v * - v o\\oo + -: £• (5) 

1-7 

Using Lemma[T]and Equation (0 twice, we can conclude by observing that 

\\ v * - V 7rk m \\oo < \\TV* - Tvk-lWoo + \\TVk-l ~ V nk m \\oo 

< j\\v* - Wfc-lHoo +I m \\vk-rn ~ VtiwHoo + ~j £ 



<7(7 fe ~ 1 H*'*-«o||oo + 1 ^ 



1-7 

+ 7™ (\\vk-rn - W*||oo + ||«* - V. 

h 

k 7 — 7 

< 7 ||t>* - z;o||oo + 



1-7 



7-7 



1-7 



1-7 

,, m I mi|„. ..II i ' ll„. „. Ilii' ' 



+ 7 7 - v oo + — — e + «» - v nk m co + —e 

V 1 - 7 / 1 -7 

= 7™lk -«W*„Hoo + 2 7 fe ||^ - UolU + 2( 7~ 7 ^ 

1-7 

<^^7 f^^ e + 7 fc ||«*-«o||ooV □ 



1 - 7 m V 1 - 7 



5 API algorithms for computing non-stationary policies 

We now present similar results that have a Policy Iteration flavour. Unlike in the previous section 
where only the output of AVI needed to be changed, improving the bound for an API-like algorithm 
is slightly more involved. In this section, we describe and analyze two API algorithms that output 
non-stationary policies with improved performance bounds. 

API with a non-stationary policy of growing period Following our findings on non-stationary 
policies AVI, we consider the following variation of API, where at each iteration, instead of comput- 
ing the value of the last stationary policy 7Tfc, we compute that of the periodic non-stationary policy 
Hk,k that loops over all the policies 7Ti , . . . , 7r& generated from the very start: 

Vk <- «7T fc . fc + eft 

tt/c+i any element of Q(vk) 

where the initial (stationary) policy 711,1 is chosen arbitrarily. Thus, iteration after iteration, the non- 
stationary policy nk,k is made of more and more stationary policies, and this is why we refer to it as 
having a growing period. We can prove the following performance bound for this algorithm: 

Theorem 3. After k iterations, the loss of running the non-stationary policy TTk,k instead of the 
optimal policy ir* satisfies: 

K - v„ h k IU < fc^Lie + 7^ IK ~ v ni , IU + 2(k - l) 7 fc K iax . 
1-7 

When k tends to infinity, this bound tends to jzhe, and is thus again a factor better than the 
original API bound. 
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Proof of Theorem\3\ Using the facts that T k+ ^ k+1 v nkk = T^ k+1 T ktk v^ Kk = T 7rk+1 v 7Tkk and 
T^ k+1 v k > T^ t v k (since ir k+1 G <7(^fc)), we have: 

= T^.w* - T k+1:k+1 Vn k+l k+1 

= 7 1 7r ,«* -T 7r ,v^ kk + T^v Wk k - T k+ x,k+i^ k h + T k+1<k+1 v Wkk - T k+ x tk+1 v Wk+ltk+1 

= 7-Ptt„0* -^,J + ^,v t - T ^+i u ^, fc +r fc +i,fc+iK fc , fc -«^ +1 , fc+1 ) 

= 7- P 7r„(«* - w^, J + T^wfe -T Wfc+1 u fc + 7 (P 7r)c+1 -P^Jefc +r fc+lifc+1 (u irfciJi -« Wfc+lifc+1 ) 
< 7-Ptt, («* ~ ) + 7(-P^ +1 - Pr, )e* + r fc+ i )fc+1 (v^ k:k - v Wk+l k+1 ). 

By taking the norm, and using the facts that |K fc J|oo < V max , |K fc+1 , fc+1 ||oo < V max , and 

IIIfc+Lfc+iHoo = 7 fc+1 ,we get: 

\\ v * ~ v ^k+x,h+i Woo < 7ll«* - v n „ <k Woo + 27c + 27 fe+ V max . 
Finally, by induction on k, we obtain: 

K-«^JU < 2( J ~ 7 ^ e + 7^ IK " ^ t Ik + 2(fc - l) 7 fc F max . 

1-7 
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Though it has an improved asymptotic performance bound, the API algorithm we have just described 
has two (related) drawbacks: 1) its finite iteration bound has a somewhat unsatisfactory term of the 
form 2(k — l) 7 fc V max , and 2) even when there is no error (when e = 0), we cannot guarantee that, 
similarly to standard Policy Iteration, it generates a sequence of policies of increasing values (it 
is easy to see that in general, we do not have v nk+1 k+1 > v^ k k ). These two points motivate the 
introduction of another API algorithm. 

API with a non-stationary policy of fixed period We consider now another variation of API 
parameterized by rn > 1, that iterates as follows for k > rn: 

v k <- v nk m + e k 

TTfc+i any element of Q(v k ) 

where the initial non-stationary policy 7r m)Tn is built from a sequence of m arbitrary stationary 
policies 7i"i, 7T2, • • • , 7r m . Unlike the previous API algorithm, the non-stationary policy Tr k , m nere 
only involves the last m greedy stationary policies instead of all of them, and is thus of fixed period. 
This is a strict generalization of the standard API algorithm, with which it coincides when m = 1. 
For this algorithm, we can prove the following performance bound: 

Theorem 4. For all m, for all k > m, the loss of running the non- stationary policy ir k _ rn instead of 
the optimal policy 7T, satisfies: 

K - v VKm |U < 7 *-|k - v„ m>m |U + (1 l l 7) ] 1 _ 7 J ) £ . 

When m — 1 and k tends to infinity, we recover exactly the bound of Theorem Q] When m > 1 
and k tends to infinity, this bound coincides with that of Theorem [2] for our non-stationary version 
of AVI: it is a factor 1 ~_2 7 better than the standard bound of Theorem[T] 

The rest of this section develops the proof of this performance bound. A central argument of our 
proof is the following lemma, which shows that similarly to the standard API, our new algorithm 
has an (approximate) policy improvement property. 

Lemma 2. At each iteration of the algorithm, the value v Wk+1 m of the non-stationary policy 

7Tfe+l,m = n k+l TTfc . . . TTk+2-m TTfc+1 Tfc . . . 7Tfc_ m -|_2 ■ • • 

cannot be much worse than the value u T ' of the non-stationary policy 

n k,m — Kk-m+1 TTfc • • • TTfc+2-m TTfe-ro+1 7Tfc . . . TTk-m+2 ■ ■ ■ 
in the precise following sense: 

27 

> v K _ - — e. 
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The policy ir k differs from n k +i. m in that every m steps, it chooses the oldest policy TTk- m +i 
instead of the newest one Hk+i- Also n' k is related to n ktm as follows: Tr' k m takes the first action 
according to n k - m +i and then runs ir kim ; equivalently, since Tr k , m loops over ir k n k -i ■ ■ ■ ffk-m+i, 
n k m = 7r fc-?ri+i 7r fc,m can be seen as a 1-step right rotation of n k m . When there is no error (when 
e = 0), this shows that the new policy Ttk+i m is better than a "rotation" of ir km . When m = 

1) TTfe+i.m = 7Tfc+i and ir' k m — ir k and we thus recover the well-known (approximate) policy 
improvement theorem for standard API (see for instance (4J Lemma 6.1]). 

Proof of Lemma\2\ Since ir' k m takes the first action with respect to Hk-m+i and then runs 7i> )TO , we 
have v„' h m = T 1Tk _ m+l v 7Tkm . Now, since ir k +i £ G(v k ), we have T^ k+1 v k > T„ k _ m+1 v k and 

= T 7Tk _ m+1 v k - jP 7Tk _ m+1 e k - v„„ +1>m 
< T nh+1 v k - jP7r k - m+1 £k - v Wk+l m 

— r ^TTk+l V ^k.m + Tv^TTfc+l ~~ PiTk-m+l )^k ~ V 7r k + l, m 

= T- Kk+1 T k . m V 7Ik m - T k+ltJn V^ k+l m +j(Pyr k + 1 - Pn k - m+ i)^k 

= Tk+l.mT-!rk-m+l v '"'k.m — r ^k+l,m,VTT k+l m + Tv^rjk+i ~ Pir k - m +i) e k 

= ^ k+l,m{T 7Tk _ m + 1 V 7Tk m — v TTk+l, m ) ^ l{P^k + l ~ P-rtk-m+i ) e k 

= T k+hm (v^ k m - V^ k+l m ) + j(Pn k+1 - P 7 r fc _ m+1 )efc. 
from which we deduce that: 

V K,m ~ V ^+t,m < (^-r fc +l,m) _1 7(-P7r fc+1 -Pn k - m + l) e k 

and the result follows by using the facts that ||efe||oo < e and || (/ — IV|-i )7 n) -1 ||oo = x _ m • □ 
We are now ready to prove the main result of this section. 

Proof of Theorem® Using the facts that 1) T k+hm+ iV nkm = T^ k+1 T k ^ m v^ km = T nk+1 v nkm and 

2) T„ k+1 v k > T„ r v k (since ir k+1 e Q(v k % we have for k > m, 



— r,r,U* — Tfc_|_i im « 7rj , +1 m 

= Tn t v* — T Wt Vir him + T 7Tr VT Tk m — Tfe + i : „ l+ ii; 7rfcim + Tfe+i lTO +ii; Wfcim — Tfc + i^ m w 7rfc+1 m 
= 7^»(t'* - v„ km ) + T 7T ,v 7Tkm - T nk+1 v„ km + r k+hm (T„ k _ m+1 v nhm - W 7 r fc+1>m ) 
<7 P *.( V * -v 7Tk m ) + T 7T ,v k -T 7rfc+1 u fe +7(P 7rfc+1 - P w ,)e fe + T k+hm (T 7rk _ m+1 v 7Tkm -v Vk+lm 
<jPtt,( v * - Vn k , m ) + 7(^71-*,+! --P 7 rjefc + r fe+ i, m (r 7rfc _ m+1 i; 7rfc m - i^+i.J- (6) 

Consider the policy 7r£, defined in Lemma [2] Observing as in the beginning of the proof of 
Lemma|2]that T^ h _ m+1 v„ h m = v^' k , Equation (O can be rewritten as follows: 

v * - v *k + i, m <7-Ptt,(u* - V nk , m ) +l{P-K h+1 ~ P7rJe k + T k+hm (v K m - V„ k+lm ). 

By using the facts that t>* > v 7Vk m , > f,r fc+1 ,„ and Lemma|2] we get 

7 ™(2 7 e) 



k -(- 1 , 771 llOO 



< 7lk* - «7r fe m Hoc + 2 7 e + 



= 7F* ~ ^ t 
Finally, we obtain by induction that for all k > m, 



1 - 7 I; 
2 7 _ 



1 - 7 r ' 



^ fc , m iioo < 7 fe ">* -^ m , m iicx> + ( 7i 7)(1 ^ 7m y - □ 



2( 7 _ yfe+l-m) 
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6 Discussion, conclusion and future work 



We recalled in Theorem Q] the standard performance bound when computing an approximately op- 
timal stationary policy with the standard AVI and API algorithms. After arguing that this bound is 
tight - in particular by providing an original argument for AVI - we proposed three new dynamic 
programming algorithms (one based on AVI and two on API) that output non-stationary policies for 
which the performance bound can be significantly reduced (by a factor — !— ). 

From a bibliographical point of view, it is the work of 1 14 1 that made us think that non-stationary 
policies may lead to better performance bounds. In that work, the author considers problems with 
a finite-horizon T for which one computes non-stationary policies with performance bounds in 
0(Te), and infinite-horizon problems for which one computes stationary policies with performance 
bounds in 0( ru^p )■ Using the informal equivalence of the horizons T ~ one sees that 
non-stationary policies look better than stationary policies. In fl4l . non-stationary policies are only 
computed in the context of finite-horizon (and thus non-stationary) problems; the fact that non- 
stationary policies can also be useful in an infinite-horizon stationary context is to our knowledge 
completely new. 

The best performance improvements are obtained when our algorithms consider periodic non- 
stationary policies of which the period grows to infinity, and thus require an infinite memory, which 
may look like a practical limitation. However, in two of the proposed algorithm, a parameter m 
allows to make a trade-off between the quality of approximation , 1 _ mT/ 1 _ > t and the amount of 



that is a 



memory 0(m) required. In practice, it is easy to see that by choosing m = — — 
memory that scales linearly with the horizon (and thus the difficulty) of the problem, one can get a 

performance bound ofl 7= — -3^-. — T e < 3 ; 1647 e. 
f u (l-e 1 )(i-7) — 1-7 

We conjecture that our asymptotic bound of fz^e, an d me non-asymptotic bounds of Theorems [2] 
and 2] are tight. The actual proof of this conjecture is left for future work. Important recent works 
of the literature involve studying performance bounds when the errors are controlled in L p norms 
instead of max-norm [19 20 2T] [T] [8] |T8] [TTj which is natural when supervised learning algorithms 
are used to approximate the evaluation steps of AVI and API. Since our proof are based on compo- 
nentwise bounds like those of the pioneer works in this topic [19 20 1, we believe that the extension 
of our analysis to L p norm analysis is straightforward. Last but not least, an important research 
direction that we plan to follow consists in revisiting the many implementations of AVI and API for 
building stationary policies (see the list in the introduction), turn them into algorithms that look for 
non-stationary policies and study them precisely analytically as well as empirically. 
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